Conversation
Co-authored-by: Joe Zemmels (he/him) <jzemmels@gmail.com>
… add more examples
|
@jzemmels @jeffskwang-usgs this PR is ready for your review. It includes the two stats API endpoints and should mirror how the functions work in R. I'm prioritizing getting the functions in so that they can be used in the current conditions pipeline, but eventually a vignette on how to use them would be nice, too. That can be a separate PR. Thanks for your feedback! |
There was a problem hiding this comment.
Nice work! Here's a summary of my review:
- Ran examples in documentation, comparing against what's returned by dR. All looks good
- Ran the unit tests locally, everything passed
- We've been using the
porfunction extensively already in the current conditions pipeline, which I think is evidence enough that the functions work as-intended.
I think the only substantive differences between these and the dR functions are the naming convention por_stats vs. stats_por and your inclusion of the expand_percentiles argument. The API output can be a bit confusing depending on the exact settings of computation_type. I'm not sure if there's a precedent from other endpoints for dealing with nested data, so mentioning the subtleties somewhere might be helpful.
| measured and the units of measure. A complete list of parameter codes | ||
| and associated groupings can be found at | ||
| https://help.waterdata.usgs.gov/codes-and-parameters/parameters. | ||
| expand_percentiles : boolean |
There was a problem hiding this comment.
May be helpful to also mention that setting expand_percentiles = False and requesting 'percentiles' and one of ['median', 'minimum', 'maximum', 'arithmetic_mean'] will return a value and values column, whereas expand_percentiles = True will consolidate these columns into a single value column. Requesting just 'percentiles' and expand_percentiles = False will return just a values column. There's probably a simpler way to describe this than how I've said.
There was a problem hiding this comment.
Good idea, I have added some information about this.
There was a problem hiding this comment.
Looks good! I wasn't saying you should change the function names to match dR, just that they were different.
The read_stats_por and read_stats_daterange naming convention was to make it easier for tab-completion (i.e., someone types read_stats then tab to see the two options appear).
There was a problem hiding this comment.
I think it's a good change. The same sort of thing can be applied in python. It's nice to be consistent. Now to deal with this sudden ubuntu failure, ugh.
|
Hi @ehinman, thanks for including me on this. I've looked over the code, but I'd also like to run the unit tests. I'm unfamiliar with testing python packages, so what's the best way to go about that? |
Thanks Jeffrey! Let's see, you'll want to make sure you have the branch version |
I probably should figure out how to use |
|
Ok, I was having a little diffuculty installing things correctly to run dataretrieval-python % pixi add geopandas
Error: × failed to solve the pypi requirements of environment 'default' for platform 'osx-arm64'
├─▶ failed to resolve pypi dependencies
╰─▶ Because you require pandas>=2.0.0,<3.0.0 and pandas==3.0.1, we can conclude that your requirements are unsatisfiable.
help: The following PyPI packages have been pinned by the conda solve, and this version may be causing a conflict:
pandas==3.0.1
See https://pixi.sh/latest/concepts/conda_pypi/#pinned-package-conflicts for more information.I had to remove the dataretrieval-python % pytest -vv tests/waterdata_test.py
========================================================================================================== test session starts ===========================================================================================================
platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/.pixi/envs/default/bin/python3.14
cachedir: .pytest_cache
rootdir: /Users/jkwang/Desktop/data-ret-test/dataretrieval-python
configfile: pyproject.toml
collected 25 items
tests/waterdata_test.py::test_mock_get_samples ERROR [ 4%]
tests/waterdata_test.py::test_check_profiles PASSED [ 8%]
tests/waterdata_test.py::test_samples_results PASSED [ 12%]
tests/waterdata_test.py::test_samples_activity PASSED [ 16%]
tests/waterdata_test.py::test_samples_locations PASSED [ 20%]
tests/waterdata_test.py::test_samples_projects PASSED [ 24%]
tests/waterdata_test.py::test_samples_organizations PASSED [ 28%]
tests/waterdata_test.py::test_get_daily PASSED [ 32%]
tests/waterdata_test.py::test_get_daily_properties PASSED [ 36%]
tests/waterdata_test.py::test_get_daily_properties_id PASSED [ 40%]
tests/waterdata_test.py::test_get_daily_no_geometry PASSED [ 44%]
tests/waterdata_test.py::test_get_continuous FAILED [ 48%]
tests/waterdata_test.py::test_get_monitoring_locations PASSED [ 52%]
tests/waterdata_test.py::test_get_monitoring_locations_hucs PASSED [ 56%]
tests/waterdata_test.py::test_get_latest_continuous FAILED [ 60%]
tests/waterdata_test.py::test_get_latest_daily PASSED [ 64%]
tests/waterdata_test.py::test_get_latest_daily_properties_geometry PASSED [ 68%]
tests/waterdata_test.py::test_get_field_measurements PASSED [ 72%]
tests/waterdata_test.py::test_get_time_series_metadata PASSED [ 76%]
tests/waterdata_test.py::test_get_reference_table PASSED [ 80%]
tests/waterdata_test.py::test_get_reference_table_with_query PASSED [ 84%]
tests/waterdata_test.py::test_get_reference_table_wrong_name PASSED [ 88%]
tests/waterdata_test.py::test_get_por_stats PASSED [ 92%]
tests/waterdata_test.py::test_get_por_stats_expanded_false PASSED [ 96%]
tests/waterdata_test.py::test_get_date_range_stats PASSED [100%]
================================================================================================================= ERRORS =================================================================================================================
________________________________________________________________________________________________ ERROR at setup of test_mock_get_samples _________________________________________________________________________________________________
file /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/tests/waterdata_test.py, line 31
def test_mock_get_samples(requests_mock):
E fixture 'requests_mock' not found
> available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, capteesys, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, subtests, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
> use 'pytest --fixtures [testpath]' for help on them.
/Users/jkwang/Desktop/data-ret-test/dataretrieval-python/tests/waterdata_test.py:31
================================================================================================================ FAILURES ================================================================================================================
__________________________________________________________________________________________________________ test_get_continuous ___________________________________________________________________________________________________________
def test_get_continuous():
df,_ = get_continuous(
monitoring_location_id="USGS-06904500",
parameter_code="00065",
time="2025-01-01/2025-12-31"
)
assert isinstance(df, DataFrame)
assert "geometry" not in df.columns
assert df.shape[1] == 11
> assert df['time'].dtype == 'datetime64[ns, UTC]'
E AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
E + where datetime64[us, UTC] = 0 2025-01-01 00:00:00+00:00\n1 2025-01-01 00:15:00+00:00\n2 2025-01-01 00:30:00+00:00\n3 2025-01-01 00:45:00+00:00\n4 2025-01-01 01:00:00+00:00\n ... \n34525 2025-12-30 23:00:00+00:00\n34526 2025-12-30 23:15:00+00:00\n34527 2025-12-30 23:30:00+00:00\n34528 2025-12-30 23:45:00+00:00\n34529 2025-12-31 00:00:00+00:00\nName: time, Length: 34530, dtype: datetime64[us, UTC].dtype
tests/waterdata_test.py:179: AssertionError
_______________________________________________________________________________________________________ test_get_latest_continuous _______________________________________________________________________________________________________
def test_get_latest_continuous():
df, md = get_latest_continuous(
monitoring_location_id=["USGS-05427718", "USGS-05427719"],
parameter_code=["00060", "00065"]
)
assert "latest_continuous_id" == df.columns[-1]
assert df.shape[0] <= 4
assert df.statistic_id.unique().tolist() == ["00011"]
assert hasattr(md, 'url')
assert hasattr(md, 'query_time')
> assert df['time'].dtype == 'datetime64[ns, UTC]'
E AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
E + where datetime64[us, UTC] = 0 2026-02-24 14:00:00+00:00\n1 2026-02-24 14:00:00+00:00\nName: time, dtype: datetime64[us, UTC].dtype
tests/waterdata_test.py:207: AssertionError
============================================================================================================ warnings summary ============================================================================================================
dataretrieval/__init__.py:9
/Users/jkwang/Desktop/data-ret-test/dataretrieval-python/dataretrieval/__init__.py:9: DeprecationWarning: The 'nwis' services are deprecated and being decommissioned. Please use the 'waterdata' module to access the new services.
from dataretrieval.nwis import *
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== short test summary info =========================================================================================================
FAILED tests/waterdata_test.py::test_get_continuous - AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
+ where datetime64[us, UTC] = 0 2025-01-01 00:00:00+00:00\n1 2025-01-01 00:15:00+00:00\n2 2025-01-01 00:30:00+00:00\n3 2025-01-01 00:45:00+00:00\n4 2025-01-01 01:00:00+00:00\n ... \n34525 2025-12-30 23:00:00+00:00\n34526 2025-12-30 23:15:00+00:00\n34527 2025-12-30 23:30:00+00:00\n34528 2025-12-30 23:45:00+00:00\n34529 2025-12-31 00:00:00+00:00\nName: time, Length: 34530, dtype: datetime64[us, UTC].dtype
FAILED tests/waterdata_test.py::test_get_latest_continuous - AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
+ where datetime64[us, UTC] = 0 2026-02-24 14:00:00+00:00\n1 2026-02-24 14:00:00+00:00\nName: time, dtype: datetime64[us, UTC].dtype
ERROR tests/waterdata_test.py::test_mock_get_samples
=========================================================================================== 2 failed, 22 passed, 1 warning, 1 error in 16.10s ============================================================================================ |
|
@jeffskwang-usgs, thanks for running these test on your machine! I believe the first error is due to the fact that you do not have all the modules installed to run the tests, namely |
|
That's right, I was able to get that part to pass after using dataretrieval-python % pixi add pytest requests-mock
WARN The package `pytest-cov==7.0.0` does not have an extra named `all`
✔ Added pytest >=9.0.2,<10
✔ Added requests-mock >=1.12.1,<2dataretrieval-python % pytest -vv tests/waterdata_test.py
===================================================================================== test session starts ======================================================================================
platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/.pixi/envs/default/bin/python3.14
cachedir: .pytest_cache
rootdir: /Users/jkwang/Desktop/data-ret-test/dataretrieval-python
configfile: pyproject.toml
plugins: requests-mock-1.12.1
collected 25 items
tests/waterdata_test.py::test_mock_get_samples PASSED |
Adds in two functions that query the two endpoints at: https://api.waterdata.usgs.gov/statistics/v0/docs
Also adds utils functions for parsing and organizing the json response. Water Data API functions could be further edited to include the stats API functions, but for now I kept them separate.
To do (1/8/26):